Chemical labeling using tandem mass tag (TMT) has been commonly applied in mass spectrometry (MS)-based quantification of proteins and peptides. The proteoQ tool is designed for automated and reproducible analysis of proteomics data. It interacts with an Excel spread sheet for dynamic sample selections, aesthetics controls and statistical modelings. It further integrates the operations against data rows and columns into functions at the users’ interface. The arrangements allow users to put ad hoc manipulation of data behind the scene and instead apply metadata to openly address biological questions using various data preprocessing and informatic tools. In addition, the entire workflow is documented and can be conveniently reproduced upon revisiting.
The tool currently processes the peptide spectrum matches (PSM) tables from Mascot, MaxQuant and Spectrum Mill searches, for 6-, 10- or 11-plex TMT experiments using Thermo’s Orbitrap mass analyzers. Peptide and protein results are then produced with users’ selection of parameters in data filtration, alignment and normalization. The package further offers a suite of tools and functionalities in statistics, informatics and data visualization by creating ‘wrappers’ around published R routines.
(Click here to render a html version of the README. Mathematical symbols may display more properly under html.)
To install this package, start R (version “3.6.2”) as administrator and enter:
In this section I (Qiang Zhang) illustrate the following applications of proteoQ:
The data set we will use in this section corresponds to the proteomics data from Mertins et al. (2018). In the study, two different breast cancer subtypes, triple negative (WHIM2) and luminal (WHIM16), from patient-derived xenograft (PDX) models were assessed by three independent laboratories. At each site, lysates from WHIM2 and WHIM16 were each split and labeled with 10-plex TMT at equal sample sizes and repeated on a different day. This results in a total of 60 samples labeled under six 10-plex TMT experiments. The samples under each 10-plex TMT were fractionated by off-line, high pH reversed-phase (Hp-RP) chromatography, followed by LC/MS analysis. The MS data were analyzed against the search engines of Mascot, MaxQuant and Spectrum Mill. Ten percent of the PSM entries were sampled randomly from the complete data sets and stored in a companion package, proteoQDA.
We first install the data package, proteoQDA for exemplary fasta, PSM and metadata files:
RefSeq databases of human and mouse were used in the MS/MS searches against the WHIM data sets. To properly annotate protein entries with proteoQ, we would need the fasta file(s) that were used in the database searches. In the example below, we copy over the corresponding fasta files from the proteoQDA to a database folder:
The data processing begins with PSM table(s) from Mascot, MaxQuant or Spectrum Mill with the following compilation in file names:
F, followed by digits and ends with .csv;msms and end with .txt;PSMexport and end with .ssv.The corresponding PSMs are available through one of the followings copy_ utilities:
# Mascot
copy_global_mascot()
# or MaxQuant
copy_global_maxquant()
# or Spectrum Mill
copy_global_sm()To illustrate, I copy over Mascot PSMs to a working direcotry, dat_dir:
dat_dir <- "~\\proteoQ\\examples"
dir.create(dat_dir, recursive = TRUE, showWarnings = FALSE)
copy_global_mascot(dat_dir)When exporting Mascot PSMs, I typically set the option of Include sub-set protein hits to 0 with my opinionated choice in satisfying the principle of parsimony. Under Peptide Match Information, the options of Header and Peptide quantitation should be checked to include the search parameters and quantitative values. The inclusion of both Start and End is recommended and the file name(s) of the exports will be taken as is.
The same peptide sequence under different PSM files can be assigned to different protein IDs when inferring proteins from peptides using algorithms such as greedy set cover. To escape from the ambiguity in protein inference, I typically enable the option of Merge MS/MS files into single search in Mascot Daemon.1 If the option is disabled, peptide sequences that have been assigned to multiple protein IDs will be removed for now when constructing peptide reports.
The merged search may become increasingly cumbersome with growing data sets. In this example, I combined the MS peak lists from the Hp-RP fractions within the same 10-plex TMT experiment, but not the lists across experiments. This results in a total of six pieces of PSM results in Mascot exports.
The workflow involves an Excel template containing the metadata of multiplex experiments, including experiment numbers, TMT channels, LC/MS injection indices, sample IDs, reference channels, RAW MS data file names and additional fields from users. The default file name for the experimental summary is expt_smry.xlsx. If samples were fractionated off-line prior to LC/MS, a second Excel template will also be filled out to link multiple RAW MS file names that are associated to the same sample IDs. The default file name for the fractionation summary is frac_smry.xlsx.2 Unless otherwise mentioned, we will assume these default file names throughout the document.
Columns in the expt_smry.xlsx are approximately divided into the following three tiers: (1) essential, (2) optional default and (3) optional open. We supply the required information of the TMT experiments under the essential columns. The optional default columns serve as the fields for convenient lookups in sample selection, grouping, ordering, aesthetics etc. For instance, the program will by default look for values under the Color column if no instruction was given in the color coding of a PCA plot. The optional open fields on the other hand allow us to define our own analysis and aesthetics. For instance, we may openly define multiple columns of contrasts at different levels of granularity for uses in statistical modelings. Description of the column keys can be found from the help document by entering ?proteoQ::load_expts from a R console.
We next copy over a pre-compiled expt_smry.xlsx and a frac_smry.xlsx to the working directory:
We now have all the pieces that are required by proteoQ in place. Let’s have a quick glance at the expt_smry.xlsx file. We note that no reference channels were indicated under the column Reference. With proteoQ, the log2FC of each species in a given sample is calculated either (a) in relative to the reference(s) within each multiplex TMT experiment or (b) to the mean of all samples in the same experiment if reference(s) are absent. Hence, the later approach will be employed to the exemplary data set that we are working with. In this special case, the mean(log2FC) for a given species in each TMT experiment is averaged from five WHIM2 and five WHIM16 aliquots, which are biologically equivalent across TMT experiments.
As a final step of the setup, we will load the experimental summary into a work space:
PSMs are MS/MS events that lead to peptide identication at certain confidence levels. The evidences in PSMs can then be summarised to peptide and protein findings using various descriptive statistics. In this section, we will apply proteoQ to summarise PSM data into peptide and protein reports.
We start the section by processing the PSM files exported directly from Mascot searches:
# columns keys in PSM files suitable for varargs of `filter_`
normPSM(
group_psm_by = pep_seq_mod,
group_pep_by = gene,
fasta = c("~\\proteoQ\\dbs\\fasta\\refseq\\refseq_hs_2013_07.fasta",
"~\\proteoQ\\dbs\\fasta\\refseq\\refseq_mm_2013_07.fasta"),
rptr_intco = 1000,
rm_craps = TRUE,
rm_krts = FALSE,
rm_outliers = FALSE,
annot_kinases = TRUE,
plot_rptr_int = TRUE,
plot_log2FC_cv = TRUE,
filter_psms = exprs(pep_expect <= .1, pep_score >= 15),
filter_more_psms = exprs(pep_rank == 1),
)Note that at present the log2FC of PSMs are always aligned by median centering across samples.3 At group_psm_by = pep_seq, PSM entries with the same primary peptide sequence but different variable modifications will be grouped for analysis using descriptive statistics. In case group_psm_by = pep_seq_mod, PSMs will be grouped alternatively according to the unique combination of the primary sequences and the variable modifications of peptides. Analogously, group_pep_by specify the grouping of peptides by either protein accession names or gene names. The fasta argument points to the location of a copy of the RefSeq fasta files that were used in the corresponding MS/MS searches. Additional options include rm_craps, rm_krts, annot_kinases et al. More description of normPSM can be found by accessing its help document via ?normPSM.
Every time the normPSM module is executed, it will process the PSM data from the ground up. In other words, it has no memory on prior happenings. For instance, after inspecting graphically the intensity distributions at plot_rptr_int = TRUE, we may consider a new cut-off at rptr_intco = 0. The downward in rptr_intco is not going to cause information loss. This is trivia but worth mentioning here. As we will find out in following sections, utilities in peptide and protein normalization, standPep and standPrn, do pass information onto successive iterations.
For experiments that are proximate in the quantities of input materials, there might still be unprecedented events that could have caused dipping in the ranges of reporter-ion intensity for certain samples. With proper justication, we might consider excluding the outlier samples from further analysis. The sample removal and PSM re-processing can be achieved by simply deleting the corresponding entries under the column Sample_ID in expt_smry.xlsx, followed by the re-execution of normPSM().
There is a subtle problem when we choose to remove PSM outliers at rm_outliers = TRUE. Note that PSM outliers will be assessed at a per-peptide-and-per-sample basis, which can be a slow process for large data sets. To circumvent repeated efforts in finding PSM outliers, we may initially set rm_outliers = FALSE and plot_rptr_int = TRUE when executing normPSM(). This will allow us to first decide on an ultimate threshold of reporter-ion intensity, before proceeding to the more time-consuming procedure in PSM outlier removals.
The normPSM function can take additional, user-defined arguments of dot-dot-dot (see Wickham 2019, ch. 6) for the row filtration of data using logical conditions. In the above example, we have limited ourselves to PSM entries with pep_expect <= 0.1 and pep_score >= 15 by supplying the variable argument (vararg) of filter_psms_at. We further filtered the data at pep_rank == 1 with another vararg of filter_psms_more. It makes no difference whether we put the conditions in one or multiple statements:
The creation and assignment of varargs need to follow a format of filter_blahblah = exprs(cdn1, cdn2, ..., cond_last). Note that the names of varargs on the lhs start with the character string of filter_ to indicate the task of data filtration. On the rhs, pep_expect, pep_score and pep_rank are column keys that can be found from the Mascot PSM data. Backticks will be needed for column keys containing white space(s) and/or special character(s): `key with space (sample id in parenthesis)`. Analogously, we can apply the vararg approach to MaxQuant and Spectrum Mill PSMs:
# `PEP` and `Mass analyzer` are column keys in MaxQuant PSM tables
normPSM(
filter_psms_at = exprs(PEP <= 0.1, `Mass analyzer` == "FTMS"),
...,
)
# `score` is a column key in Spectrum Mill PSM tables
normPSM(
filter_psms_at = exprs(score >= 10),
...,
)I am new to R. It looks like that canonical R does not support the straight assignment of logical expressions to function arguments. To get around this, I took advantage of the facility of non-standard evaluation in rlang package in that the logical conditions are supplied within the round parenthesis after exprs. Next, the proteoQ program will obtain the expression(s) on the rhs of each vararg statment by performing a bare evaluation using rlang::eval_bare. Following that, a tidy evaluation by rlang::eval_tidy will be coupled to a local facility in proteoQ to do the real work of data filtrations ((see Wickham 2019, ch. 20)).
The approach of data filtration taken by normPSM might at first looks strange; however, it allows me to perform data filtration in a integrated way. As mentioned in the beginning, a central theme of proteoQ is to reduce or avoid direct data manipulations but utilizes metadata to control both data columns and rows. With the self-containedness in data filtration (and data ordering later), I can readily recall and reproduce what I had done when revisiting the system after an extended peroid. Otherwise, I would likely need ad hoc operations by mouse clicks or writing ephemeral R scripts, and soon forget what I have done.
Moreover, the build-in approach can serve as building blocks for more complex data processing. As shown in the help documents via ?standPep and ?standPrn, we can readily perform mixed-bed normalization by sample groups, against either full or partial data.
With normPSM, we can pretty much filter_ data under any PSM columns we like. In the above Mascot example, I have chosen to filter PSM entires by their pep_expect, pep_score etc. There is a reason for this.
Let’s first consider a different column pep_len. The values underneath are unique to both PSMs and peptides. As you might courteously agree, its time has not yet come in terms of tentative data filtration by peptide length. In other words, we can delay the filtration of peptide entries by their sequence lengths when we are actually working with peptide data. The summarization of PSMs to peptides is not going to change the number of amino acid residues in peptides. By contrast, the data under pep_expect are unique to PSMs, but not necessary to peptides. This is obvious in that each of the PSM events of the same peptide is likely to have its own confidence expectation in peptide identification. Therefore, if we were to filter data by their pep_expect values at a later stage of analysis, we would have lost the authentic information in pep_expect for peptides with mulitple PSM identifications. More specifically, the values under pep_expect in peptide tables are the geometric-mean representation of PSM results (see also section 4).
For this reason, I named the varargs filter_psms_at and filter_psms_more in the above normPSM examples. This allows me to readily recall that I was filtering data based on criteria that are specific to PSMs.
Vararg statements of filter_ and arrange_ are available in proteoQ for flexible filtration and ordering of data rows. To take advantage of the feature, we need to be aware of the column keys in input files. As indicated by their names, filter_ and filter2_ perform row filtration against column keys from a primary data file, df, and secondary data file(s), df2, respectively. The same correspondance is applicable for arrange_ and arrange2_ varargs.
Users will typically employ either primary or secondary vararg statements, but not both. In the more extreme case of gspaMap(...), it links prnGSPA(...) findings in df2 to the significance p-values and abundance fold changes in df for volcano plot visualizaitons by gene sets. The table below summarizes the df and the df2 for varargs in proteoQ.
| Utility | Primary | Secondary |
|---|---|---|
| normPSM | filter_: Mascot, F[…].csv; MaxQuant, msms[…].txt; SM, PSMexport[…].ssv | NA |
| PSM2Pep | NA | NA |
| mergePep | filter_: TMTset1_LCMSinj1_Peptide_N.txt | NA |
| standPep | slice_: Peptide.txt | NA |
| Pep2Prn | filter_: Peptide.txt | NA |
| standPrn | slice_: Protein.txt | NA |
| pepHist | filter_: Peptide.txt | NA |
| prnHist | filter_: Protein.txt | NA |
| pepSig | filter_: Peptide[_impNA].txt | NA |
| prnSig | filter_: Protein[_impNA].txt | NA |
| pepMDS | filter_: Peptide[_impNA][_pVal].txt | NA |
| prnMDS | filter_: Protein[_impNA][_pVal].txt | NA |
| pepPCA | filter_: Peptide[_impNA][_pVal].txt | NA |
| prnPCA | filter_: Protein[_impNA][_pVal].txt | NA |
| pepEucDist | filter_: Peptide[_impNA][_pVal].txt | NA |
| prnEucDist | filter_: Protein[_impNA][_pVal].txt | NA |
| pepCorr_logFC | filter_: Peptide[_impNA][_pVal].txt | NA |
| prnCorr_logFC | filter_: Protein[_impNA][_pVal].txt | NA |
| pepHM | filter_, arrange_: Peptide[_impNA][_pVal].txt | NA |
| prnHM | filter_, arrange_: Protein[_impNA][_pVal].txt | NA |
| anal_prnTrend | filter_: Protein[_impNA][_pVal].txt | NA |
| plot_prnTrend | NA | filter2_: […]Protein_Trend_{NZ}[_impNA][…].txt |
| anal_pepNMF | filter_: Peptide[_impNA][_pVal].txt | NA |
| anal_prnNMF | filter_: Protein[_impNA][_pVal].txt | NA |
| plot_pepNMFCon | NA | filter2_: […]Peptide_NMF[…]_consensus.txt |
| plot_prnNMFCon | NA | filter2_: […]Protein_NMF[…]_consensus.txt |
| plot_pepNMFCoef | NA | filter2_: […]Peptide_NMF[…]_coef.txt |
| plot_prnNMFCoef | NA | filter2_: […]Protein_NMF[…]_coef.txt |
| plot_metaNMF | filter_, arrange_: Protein[_impNA][_pVal].txt | NA |
| prnGSPA | filter_: Protein[_impNA]_pVals.txt | NA |
| prnGSPAHM | NA | filter2_: […]Protein_GSPA_{NZ}[_impNA]_essmap.txt |
| gspaMap | filter_: Protein[_impNA]_pVal.txt | filter2_: […]Protein_GSPA_{NZ}[_impNA].txt |
| anal_prnString | filter_: Protein[_impNA][_pVals].txt | NA |
To finish our discussion of PSM processing, let us consider having one more bash in data cleanup. The corresponding utility is purgePSM. It performs data purging by the CV of peptides, measured from contributing PSMs within the same sample. Namely, quantitations that have yielded peptide CV greater than a user-supplied cut-off will be replaced with NA.
The purgePSM utility reads files \PSM\TMTset1_LCMSinj1_PSM_N.txt, TMTset1_LCMSinj2_PSM_N.txt etc. from a preceding step of normPSM. To revert programmatically the changes made by purgePSM, we would need to start over with normPSM. Alternatively, we may make a temporary copy of these files for a probable undo.
This process takes place sample (column)-wisely while holding the places for data points that have been nullified. It is different to the above row filtration processes by filter_ in that there is no row removals with purging, not until all-NA rows are encountered.
Earlier in section 1.2.1, we have set plot_log2FC_cv = TRUE by default when calling normPSM. This will plot the distributions of the CV of peptide log2FC. In the event of plot_log2FC_cv = FALSE, we can have a second chance in visualzing the distributions of peptide CV before any permanent data nullification:
Taking the sample entries under TMT_Set one and LCMS_Injection one in label_scheme.xlsx as an example, we can see that a small portion of peptides have CV greater than 0.5 at log2 scale (Figure 1A).
Figure 1A-1C. CV of peptide log2FC (based on full data set). Left: no CV cut-off; middle: CV cut-off at 0.5; right: CV cut-off at 95 percentile.
Quantitative differences greater than 0.5 at a log2 scale is relatively large in TMT experiments,4 which can be in part ascribed to a phenomenum called peptide co-isolation and co-fragmentation in reporter ion-based MS experiments. We might, for instance, perform an additional cleanup by removing column-wisely data points with CV greater than 0.5 (Figure 1B):
The above method using a flat cut-off would probably fall short if the ranges of CV are considerably different across samples (see Lab 3.1). Alternatively, we can remove low-quality data points using a CV percentile, let’s say at 95%, for each sample (Figure 1C):
# copy back `\PSM\TMTset1_LCMSinj1_PSM_N.txt` etc. before proceed
# otherwise the net effect will be additive to the prior(s)
purgePSM (
pt_cv = 0.95,
)In the event of both pt_cv and max_cv being applied to nullify data, they follow the precedence of pt_cv > max_cv. When needed, we can overrule the default by executing purgePSM sequentially at a custom order:
# at first no worse than 0.5
purgePSM (
max_cv = 0.5,
)
# next `pt_cv` on top of `max_cv`
purgePSM (
pt_cv = 0.95,
)The data purge is also additive w.r.t. to repetative analysis. In the following example, we are actually perform data cleanup at a CV threshold of 90%:
While multiple PSMs carry information about the precision in peptide measures, the above single-sample variance does not inform sampling errors prior to peptide separations. For instance, the same peptide species from a given sample remain indistinguishable/exchangeable prior to the off-line fractionation. As a result, the CV shown by normPSM or purgePSM mainly tell us the uncertainty of measures beyond the point of peptide parting.
NB: CV is sensitive to outliers and some large CV in peptide quantitations may be merely due to a small number of bad measures. Although the option of rm_outliers was set to FALSE during our earlier call to normPSM, I think it is generally a good idea to have rm_outliers = TRUE.
In this section, we summarise the PSM results to peptides with PSM2Pep, mergePep, standPep and optional purgePep.
The utility for the summary of PSMs to peptides is PSM2Pep:
It loads the PSM tables from the preceding normPSM procedure and summarize them to peptide data using various descriptive statistics (see also Section 4). For intensity and log2FC data, the summarization method is specified by argument method_psm_pep, with median being the default.
Following the summarization of PSMs to peptides, the utility mergePep will assemble individual peptide tables, Peptide\TMTset1_LCMSinj1_Peptide_N.txt, TMTset1_LCMSinj2_Peptide_N.txt etc., into one larger piece, Peptide.txt.
Similar to normPSM, we can filter data via column keys linked to the varargs of filter_. In the exemplary vararg statement of filter_peps_by, we exlcude longer peptide sequences with more than 100 amino acid residues. If we are interested in human, but not mouse, peptides from the pdx samples, we can specify similarly that species == "human". Sometimes, it may remain unclear on proper data filtration at the early stage of analysis. In that case, we may need additional quality assessments that we will soon explore. Alternatively, we may keep as much information as possible and apply varargs in downstream analysis.
Note that pep_len is a column key in TMTset1_LCMSinj1_Peptide_N.txt with Mascot workflows. Depends on the search engines, we might need to employ different key names for the same purpose:
The utility standPep standardizes peptide results from mergePep with additional choices in data alignment.
standPep(
range_log2r = c(10, 90),
range_int = c(5, 95),
method_align = MGKernel,
n_comp = 3,
seed = 749662,
maxit = 200,
epsilon = 1e-05,
)The parameters range_log2r and range_int outline the ranges of peptide log2FC and reporter-ion intensity, respectively, for use in defining the CV and scaling the log2FC across samples. The log2FC of peptide data will be aligned by median centering across samples by default. If method_align = MGKernel is chosen, log2FC will be aligned under the assumption of multiple Gaussian kernels.5 The companion parameter n_comp defines the number of Gaussian kernels and seed set a seed for reproducible fittings. Additional parameters, such as, maxit and epsilon, are defined in and for use with normalmixEM.
It is also feasible to perform standPep aganist defined sample columns and data rows. Moreover, the utility can be applied iteratively with cummulative effects. Combinations and iterations of the features can lead to specialty sample alignments that will discuss soon (sections 1.3.5 - 1.3.7). Before delving more into the details, we would probably need some helps from the pepHist utility in the immediately following.
The pepHist utility plots the histograms of peptide log2FC. It further bins the data by their contributing reporter-ion intensity. In the examples shown below, we compare the log2FC profiles of peptides with and without scaling normalization:6
# without scaling
pepHist(
scale_log2r = FALSE,
ncol = 10,
)
# with scaling
pepHist(
scale_log2r = TRUE,
ncol = 10,
)By default, the above calls of pepHist will look for none void entries under column Select in expt_smry.xlsx. This will results in histogram plots with 60 panels in each, which may not be easy to explore as a whole. In stead, we will break the plots down by their data origins. We begin with modifying the expt_smry.xlsx file by adding the columns BI_1, JHU_1 etc. Each of the new columns includes sample entries that are tied to their laboratory origins and TMT batches (the columns are actually already in the expt_smry.xlsx).
We now are ready to plot histograms for each subset of the data. In this document, we only display the plots using the BI_1 subset:
# without scaling
pepHist(
scale_log2r = FALSE,
col_select = BI_1,
ncol = 5,
filename = bi1_n.png,
)
# with scaling
pepHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filename = bi1_z.png,
)NB: We interactively told pepHist() that we are interested in sample entries under the newly created BI_1 column. Behind the scene, the interactions are facilitated by openxlsx via the reading of the Setup workbook in expt_smry.xlsx. We also supply a file name, assuming that we want to keep the earlierly generated plots with default file names of Peptide_Histogram_N.png and Peptide_Histogram_Z.png.
Figure 2A-2B. Histograms of peptide log2FC. Top: scale_log2r = FALSE; bottom, scale_log2r = TRUE
As expected, both the widths and the heights of log2FC profiles become more comparable after the scaling normalization. However, such adjustment may cause artifacts when the standard deviaiton across samples are genuinely different. I typically test scale_log2r at both TRUE and FALSE, then make a choice in data scaling together with my a priori knowledge of the characteristics of both samples and references.7 We will use the same data set to illustrate the impacts of reference selections in scaling normalization in Lab 3.1.
It should also be noted that the curves of Gaussian density in histograms are calculated during the latest call to standPep(...) with the option of method_align = MGKernel. There is a useful side effect when comparing leading and lagging profiles of log2FC. In the following barebone example, we align differently the peptide log2FC with the default method of median centering:
We then visuzlize the histograms of the ratio profiles (Figure 2C):
Within this document, the preceding example that involves standPep(...) at method_align = MGKernel is given in section 1.3.3. In this case, a comparison between the present and the prior histograms will reveal the difference in ratio alignments between a median centering and a three-Gaussian assumption. More examples in the side effects can be found from the help document via ?standPep and ?pepHist.
Figure 2C-2D. Histograms of peptide log2FC. Top: median-centering for all samples; bottom: W2.BI.TR2.TMT1 aligned differently by Gaussian density
The varargs of filter_ are also available in the pepHist utility. With the following examples, we can visualize the peptide log2FC with human and mouse origins, respectively:
Now that we have been acquainted with pepHist, let’s revisit and explore additionally standPep with its features in normalizing data against defined sample columns (and data rows in the following sections).
Needs in data (re)normalization may be encountered more often than not. One of the trivial circumstances is that a multi-Gaussian kernel can fail capturing the log2FC profiles for a subset of samples. This is less an issue with a small number of samples. Using a trial-and-error approach, we can start over with a new combination of parameters, such as a different seed, and/or a different range of range_log2r etc. However, the one-size-fit-all attempt may remain inadequate when the number of samples is relatively large. The proteoQ allows users to focus fit aganist selected samples. This is again the job of argument col_select. Let’s say we want to re-fit the log2FC for samples W2.BI.TR2.TMT1 and W2.BI.TR2.TMT2. We simply add a column, which I named it Select_sub, to expt_smry.xlsx with the sample entries for re-fit being indicated under the column:
We may then execute the following codes with argument col_select being linked to the newly created column:
standPep(
method_align = MGKernel,
range_log2r = c(10, 90),
range_int = c(5, 95),
n_comp = 3,
seed = 749662,
maxit = 200,
epsilon = 1e-05,
col_select = Select_sub,
)
pepHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filename = mixed_bed_3.png,
)In the preceding execution of barebone standPep(), samples were aligned by median centering (Figure 2C). As expected, the current partial re-normalization only affects samples W2.BI.TR2.TMT1 and W2.BI.TR2.TMT2 (Figure 2D, W2.BI.TR2.TMT2 not shown). In other words, samples W2.BI.TR2.TMT1 and W2.BI.TR2.TMT2 are now aligned by their Gaussian densities whereas the remaining are by median centering. The combination allows us to align sample by mixed-bedding the MC or the MGKernel method.
We have earlierly applied the varargs of filter_ in normPSM and mergePep to subset data rows. With this type of arguments, data entries that have failed the filtration criteria will be removed for indicated analysis.
Similarly, we employed the filter_ varargs in pepHist to subset peptides with human or mouse origins (section 1.3.4.3). This is often not an issue in informatic analysis and visualization, as we do not typically store the altered inputs on external devices at the end. Sometimes we may however need to carry out similar tasks based on partial inputs and update the complete set of data for future uses. One of the circumstances is model parameterization by a data subset and to apply the finding(s) to update the complete set.
The standPep utility accepts variable arguments of slice_. The vararg statement(s) identify a subset of data rows from the Peptide.txt. The partial data will be taken for parameterizing the alignment of log2FC across samples. In the hypothetical example shown below, we normalize peptide data based peptide entries with sequence lengths greater than 10 and smaller than 30. The full data set will be updated accordingly with the newly derived parameters. Different to the filter_ varargs, there is no data entry removals from the complete data set with the slice_ procedure.
## DO NOT RUN
standPep(
...,
slice_peps_by = exprs(pep_len > 10, pep_len < 30),
)
## END of DO NOT RUNThe varargs are termed slice_ to make distinction to filter_. Although it might at first seem a little involved, the underlying mechanism is simple: col_select defines the sample columns and slice_ defines the data rows in Peptide.txt; and only the intersecting area between columns and rows will be subject additively to the parameterization in data alignment. The same pattern will be applied every time we invoke standPep.
Just like col_select and filter_ in pepHist, the combination in fixed argument col_select and variable argument slice_ can lead to features in versatile data processing. Several working examples are detailed and can be accessed via ?standPep and ?standPep.8
Now it becomes elementary if we were to normalize data against housekeeping protein(s). Let’s say we have GAPDH in mind as a housekeeping invariant among the proteomes, and of course we have good accuracy in their log2FC. We simply slice the peptide entries under GAPDH out for use as a normalizer:
standPep(
method_align = MC,
range_log2r = c(10, 90),
range_int = c(5, 95),
col_select = Select_sub,
slice_hskp = exprs(gene %in% c("GAPDH")),
)
pepHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filename = housekeepers.png,
)Note that I chose method_align = MC in the above. There are only a few rows available for the samples linked to col_select, after slicing out GAPDH! The number of data points is too scare for fitting the selected samples against a 3-component Gaussian. A more detailed working example can also be found via ?standPep where you would probably agree that GAPDH is actually not a good normalizer for the data set.9
Analogously to the PSM processing, we may nullify data points of peptides by specifying a cut-off in their protein CVs:
# no purging
purgePep()
# or purge column-wisely by max CV
purgePep (
max_cv = 0.5,
filename = "by_maxcv.png",
)
# or purge column-wisely by CV percentile
# remember the additive effects
purgePep (
pt_cv = 0.5,
filename = "by_ptcv.png",
)NB: The above single-sample CVs of proteins are based on ascribing peptides, which thus do not inform the uncertainty in sample handling prior to the parting of protein entities, for example, the enzymatic breakdown of proteins in a typical MS-based proteomic workflow. On the other hand, the peptide log2FC have been previously summarized by the median statistics from contributing PSMs. Putting these two togother, the CV by purgePep describes approximately the uncentainty in sample handling from the breakdown of proteins to the off-line fractionation of peptides.
In this section, we summarise peptides to proteins, for example, using a two-component Gaussian kernel and customized filters.
The utility for the summary of peptides to proteins is Pep2Prn:
It loads the Peptide.txt and summarize the peptide data to interim protein results in Protein.txt, using various descriptive statistics (see also Section 4). For intensity and log2FC data, the summarization method is specified by argument method_pep_prn, with median being the default.
The utitily also accept varargs of filter_ for data row filtration against the column keys in Peptide.txt.
The utility standPrn standardizes protein results from Pep2Prn with additional choices in data alignment.
standPrn(
range_log2r = c(10, 90),
range_int = c(5, 95),
method_align = MGKernel,
n_comp = 2,
seed = 749662,
maxit = 200,
epsilon = 1e-05,
slice_prots_by = exprs(prot_n_pep >= 2),
)It loads Protein.txt from Pep2Prn or a preceding standPrn procedure and align protein data at users’ choices. The utility is analogous to standPep with choices in col_select and slice_. In the above example, the normalization is against proteins with two more identifying peptides. For helps, try ?standPrn.
Similar to the peptide summary, we can inspect the alignment and the scale of ratio profiles for protein data:
# without scaling
prnHist(
scale_log2r = FALSE,
col_select = BI_1,
ncol = 5,
filename = bi1_n.png,
)
# with scaling
prnHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filename = bi1_z.png,
)For simplicity, we only display the histograms with scaling normalization (Figure 2E).
Figure 2E-2F. Histograms of protein log2FC at scale_log2r = TRUE. Left: before filtration; right, after filtration
In section 1.3.4.2, we used pepHist to illustrate the side effects in histogram visualization when toggling the alignment methods between MC and MGKernel. In the following, we will show another example of side effects using the protein data.
We prepare the ratio histograms for proteins with ten or more quantifying peptides:
# without scaling
prnHist(
scale_log2r = FALSE,
col_select = BI_1,
ncol = 5,
filter_prots_by = exprs(prot_n_pep >= 10),
filename = bi1_n_npep10.png,
)
# with scaling
prnHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filter_prots_by = exprs(prot_n_pep >= 10),
filename = bi1_z_npep10.png,
)The density curves are based on the latest call to standPrn(...) with method_align = MGKernel (Figure 2E). For simplicity, we again only show the current plots at scale_log2_r = TRUE (Figure 2F). The comparison between the lead and the lag allows us to visualize the heteroscedasticity in data and in turn inform new parameters in data renormalization.
Up to this point, we might have reach a consensus on the choice of scaling normalization. If so, it may be plausible to set the value of scale_log2r under the Global environment, which is typically the R console that we are interacting with.
In this way, we can skip the repetitive setting of scale_log2r in our workflow from this point on, and more importantly, prevent ourselves from peppering the values of TRUE or FALSE in scale_log2r from analysis to analysis.
Scripts that were used in this document can be accessed via:
Another good place to get started is via the help ?load_expts. More workflow scripts are under construction.
In this section I illustrate the following applications of proteoQ:
Unless otherwise mentioned, the in-function filtration of data by varargs of filter_ is available throughout this section of informatic analysis. Row ordering of data, indicated by arrange_, is available for heat map applications using pepHM and prnHM.
We first visualize MDS, PCA and Euclidean distance against the peptide data. We start with metric MDS for peptide data:
Figure 3A. MDS of peptide log2FC at scale_log2r = TRUE
It is clear that the WHIM2 and WHIM16 samples are well separated by the Euclidean distance of log2FC (Figure 3A). We next take the JHU data subset as an example to explore batch effects in the proteomic sample handling:
# `JHU` subset
pepMDS(
col_select = JHU,
filename = jhu.png,
show_ids = FALSE,
height = 3,
width = 8,
)
Figure 3B-3C. MDS of peptide log2FC for the JHU subset. Left: original aesthetics; right, modefied aesthetics
We immediately spot that all samples are coded with the same color (Figure 3B). This is not a surprise as the values under column expt_smry.xlsx::Color are exclusively JHU for the JHU subset. For similar reasons, the two different batches of TMT1 and TMT2 are distinguished by transparency, which is governed by column expt_smry.xlsx::Alpha. We may wish to modify the aesthetics using different keys: e.g., color coding by WHIMs and size coding by batches, without the recourse of writing new R scripts. One solution is to link the attributes and sample IDs by creating additional columns in expt_smry.xlsx. In this example, we have had coincidentally prepared the column Shape and Alpha to code WHIMs and batches, respectively, for the JHU subset. Therefore, we can recycle them directly to make a new plot (Figure 3C):
# new `JHU` subset
pepMDS(
col_select = JHU,
col_fill = Shape, # WHIMs
col_size = Alpha, # batches
filename = new_jhu.png,
show_ids = FALSE,
height = 3,
width = 8,
)Accordingly, the prnMDS performs MDS for protein data. For PCA analysis, the corresponding functions are pepPCA and prnPCA for peptide and protein data, respectively.
While MDS approximates Euclidean distances at a low dimensional space. Sometimes it may be useful to have an accurate view of the distance matrix. Functions pepEucDist and prnEucDist plot the heat maps of Euclidean distance matrix for peptides and proteins, respectively. They are wrappers of pheatmap. Supposed that we are interested in visualizing the distance matrix for the JHU subset:
# `JHU` subset
pepEucDist(
col_select = JHU,
annot_cols = c("Shape", "Alpha"),
annot_colnames = c("WHIM", "Batch"),
# `pheatmap` parameters
display_numbers = TRUE,
number_color = "grey30",
number_format = "%.1f",
clustering_distance_rows = "euclidean",
clustering_distance_cols = "euclidean",
fontsize = 16,
fontsize_row = 20,
fontsize_col = 20,
fontsize_number = 8,
cluster_rows = TRUE,
show_rownames = TRUE,
show_colnames = TRUE,
border_color = "grey60",
cellwidth = 24,
cellheight = 24,
width = 14,
height = 12,
filename = jhu.png,
)Parameter annot_cols defines the tracks to be displayed on the top of distrance-matrix plots. In this example, we have choosen expt_smry.xlsx::Shape and expt_smry.xlsx::Alpha, which encodes the WHIM subtypes and the batch numbers, respectively. Parameter annot_colnames allows us to rename the tracks from Shape and Alpha to WHIM and Batch, respectively, for better intuition. We can alternatively add columns WHIM and Batch if we choose not to recycle and rename columns Shape and Alpha.
Figure 3D. EucDist of peptide log2FC at scale_log2r = TRUE
In this section, we visualize the batch effects and biological differences through correlation plots. The proteoQ tool currently limits itself to a maximum of 44 samples for a correlation plot. In the document, we will perform correlation analysis against the PNNL data subset. By default, samples will be arranged by the alphabetical order for entries under the column expt_smry.xlsx::Select. We have learned from the earlier MDS analysis that the batch effects are smaller than the differences between W2 and W16. We may wish to put the TMT1 and TMT2 groups adjacient to each other for visualization of more nuance batch effects, followed by the comparison of WHIM subtypes. We can achieve this by supervising sample IDs at a customized order. In the expt_smry.xlsx, We have prepared an Order column where samples within the JHU subset were arranged in the descending order of W2.TMT1, W2.TMT2, W16.TMT1 and W16.TMT2. Now we tell the program to look for the Order column for sample arrangement:
# peptide logFC
pepCorr_logFC(
col_select = PNNL,
col_order = Order,
filename = pep_pnnl.png,
)
# protein logFC
prnCorr_logFC(
col_select = PNNL,
col_order = Group,
filename = prn_pnnl.png,
)